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ABSTRACT 

In this paper we investigate the convergence properties of 
a variant of the Covariance Matrix Adaptation Evolution 
Strategy (CMA-ES). Our study is based on the recent the- 
oretical foundation that the pure rank-/i update CMA-ES 
performs the natural gradient descent on the parameter space 
of Gaussian distributions. We derive a novel variant of the 
natural gradient method where the parameters of the Gaus- 
sian distribution are updated along the natural gradient to 
improve a newly defined function on the parameter space. 
We study this algorithm on composites of a monotone func- 
tion with a convex quadratic function. We prove that our 
algorithm adapts the covariance matrix so that it becomes 
proportional to the inverse of the Hessian of the original 
objective function. We also show the speed of covariance 
matrix adaptation and the speed of convergence of the pa- 
rameters. We introduce a stochastic algorithm that approx- 
imates the natural gradient with finite samples and present 
some simulated results to evaluate how precisely the stochas- 
tic algorithm approximates the deterministic, ideal one un- 
der finite samples and to see how similarly our algorithm 
and the CMA-ES perform. 

Categories and Subject Descriptors 

G.1.6 [Numerical Analysis]: Optimization — Global op- 
timization, Gradient methods, Unconstrained optimization; 
F.2.1 [Analysis of Algorithms and Problem Complex- 
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1. INTRODUCTION 

The Covariance Matrix Adaptation Evolution Strategy 
(CMA-ES, |13H15| ) is a stochastic search algorithm for non- 
separable and ill-conditioned black-box continuous optimiza- 
tion. In the CMA-ES, search points are generated from a 
Gaussian distribution and the mean vector and the covari- 
ance matrix of the Gaussian distribution are adapted by 
using the sampled points and their objective value ranking. 
These parameters' update rules are designed so as to en- 
hance the probability of generating superior points in the 
next iteration in a way similar to but slightly different from 
the (weighted) maximum likelihood estimation. Adaptive- 
ESs including the CMA-ES are successfully applied in prac- 
tice. However, their theoretical analysis even on a simple 
function is complicated and linear convergence has been 
proven only for simple algorithms compared to the CMA- 
ES [BUTT]. 

Resent studies |TJ[TTJ demonstrate the link between the 
parameter update rules in the CMA-ES and the natural 
gradient method, the latter of which is the steepest as- 
cent /descent method on a Riemannian manifold and is often 
employed in machine learning [2]|4, 8 21-23]. The natural 
gradient view of the CMA-ES has been developed and ex- 
tended in [5] and the Information-Geometric Optimization 
(IGO) algorithm has been introduced as the unified frame- 
work of natural gradient based stochastic search algorithms. 
Given a family of probability distributions parameterized by 
9 £ O, the IGO transforms the original objective function, 
/, to a fitness function, J, defined on O. The IGO algorithm 
performs a natural gradient ascent aiming at maximizing J. 
For the family of Gaussian distributions, the IGO algorithm 
recovers the pure rank-/x update CMA-ES [14], for the fam- 
ily of Bernoulli distributions, PBIL [JJ is recovered. The 
IGO algorithm can be viewed as the deterministic model of 
a recovered stochastic algorithm in the limit of the number 
of sample points going to infinity. 

The IGO offers a mathematical tool for analyzing the be- 
havior of stochastic algorithms. In this paper, we analyze 
the behavior of the deterministic model of the pure rank-p 
update CMA-ES, which is slightly different from the IGO 
algorithm. We are interested in knowing what is the tar- 
get matrix of the covariance matrix update and how fast 
the covariance matrix learns the target. The CM A is de- 
signed to solve ill-conditioned objective function efficiently 
by adapting the metric — covariance matrix in the CMA- 
ES — as well as other variable metric methods such as quasi- 
Newton methods [20]. Speed of optimization depends on the 
precision and the speed of metric adaptation to a great ex- 



tend. There is a lot of empirical evidence that the covariance 
matrix tends to be proportional to the inverse of the Hessian 
matrix of the objective function in the CMA-ES. However, 
it has not been mathematically proven yet. We are also in- 
terested in the speed of convergence of the mean vector and 
the covariance matrix. Convergence of the CMA-ES has not 
been reported up to this time. We tackle these issues in this 
work. 

In this paper, we derive a novel natural gradient algorithm 
in a similar way to the ICO algorithm, where the objective 
function / is transformed to a function J in a different way 
from the ICO so that we can derive the explicit form of the 
natural gradient for composite functions of a strictly increas- 
ing function and a convex quadratic function. We call the 
composite functions monotonic convex- quadratic- composite 
functions. The resulting algorithm inherits important prop- 
erties of the IGO and the CMA-ES, such as invariance under 
monotone transformation of the objective function and in- 
variance under affine transformation of the search space. We 
theoretically study this natural gradient method on mono- 
tonic convex-quadratic-composite functions. We prove that 
the covariance matrix adapts to be proportional to the in- 
verse of the Hessian matrix of the objective function. We 
also investigate the speed of the covariance matrix adapta- 
tion and the speed of convergence of the parameters. 

The rest of the paper is organized as follows. In Section [2] 
we propose a novel natural gradient method and present a 
stochastic algorithm that approximates the natural gradient 
from finite samples. The basic properties of both algorithms 
are described. In Section [3] we study the convergence prop- 
erties of the deterministic algorithm on monotonic convex- 
quadratic-composite functions. The convergence of the con- 
dition number of the product of the covariance matrix and 
the Hessian matrix of the objective function to one and its 
linear convergence are proven. Moreover, the rate of conver- 
gence of the parameter is shown. In Section [4] we conduct 
experiments to see how accurately the stochastic algorithm 
approximates the deterministic algorithm and to see how 
similarly our algorithm and the CMA-ES behave on a con- 
vex quadratic function. Finally, we summarize and conclude 
this paper in Section [S] 

2. THE ALGORITHMS 

We first introduce a generic framework of the natural gra- 
dient algorithm that includes the IGO algorithm. 

The original objective is to minimize / : X — > R, where 
X is a metric space. Let T and [i be the Borel <r-field 
and a measure on X. Hereunder, we assume that / is /i- 
measurable. Let v represent any monotonically increasing 
set function on T , i.e., v(A) ^ v(B) for any A, B (E J- s.t. 
A C B. We transform / to an invariant cost function de- 
fined as Vf : x h-> v[y : f(y) ^ f[x)]. Given a family of prob- 
ability distributions Pg on X, we define a gttasi-objective 
function J on the parameter space O as the expected value 
of Vf over Pg, namely 

j(e) = E x ^ e [Vf(x)] . 

Our algorithm performs the natural gradient descent on a 
Riemannian manifold (<d,Tg) equipped with the Fisher met- 
ric Tg. The Fisher metric is the unique metric that does not 
depend on the choice of parameterization [3]. The natu- 
ral gradient — the gradient taken w.r.t. the Fisher metric — is 
given by the product of the inverse of the Fisher information 



matrix Tg and the "vanilla" gradient VJ(#) of the function. 
Therefore, the natural gradient of J is Ig 1 \7J(8) and the 
parameter update follows 



Qt + l 



(1) 



where rf is the learning rate. 

2.1 Deterministic NGD Algorithm on R d 

In the following, we focus on the optimization in R d . Thus, 
X = R d , fj, is the Lebesgue measure ^tLcb on R d , and T is 
the Borel cr-field B d on R d . 

We choose as the sampling distribution the Gaussian Pg 
parameterized by 8 £ O, where the mean vector m(8) is 
in R and the covariance matrix C(8) is a symmetric and 
positive definite matrix of dimension d. 

We define the invariant cost Vf(x) by using the Lebesgue 
measure ^ Leb as V f (x) = uliily : f(y) ^ /(»)]• Then, 
the infimum of J (8) — Ex~p„ [V/(X)] is zero located on a 
boundary of the domain O where the mean vector equals 
the global minimum of / and the covariance matrix is zero. 

The choice of the parameterization of Gaussian distribu- 
tions affects the behavior of the natural gradient update 
JTJ) with finite learning rate 77', although the steepest direc- 
tion of J on the statistical manifold O is invariant under 
the choice of parameterization. We choose the mean vec- 
tor and the covariance matrix of the Gaussian distribution 
as the parameter as well as are chosen in the CMA-ES and 
in other algorithms such as EMNA [18] and cross entropy 
method [10]. Let 8 — [m T , vec(C) T ] T , where vec(C) be the 
vectorization of C such that the (i, j)ih element of C corre- 
sponds to i + d(j — l)th element of vec(C) (see [E]). Then 
the Fisher information matrix has an analytical form 



c~ 








Kc-^c- 1 ) 



(2) 



where <S> denotes the Kronecker product operator. Under 
some regularity conditions for the exchange of integration 
and differentiation we have 



VJ(0) = Ex~p 4 [V f (X)X7l(8; X)} , 



(3) 



where l(8;x) = lnpg(x) is the log-likelihood. The gradient 
of the log-likelihood Vl(8; x) can be written as 



Vl{0;x) 



C-^x- 
z(C~ 1 (x-m)(x- 



Then, the natural gradient I g 1 X7J(8) 
at 9 = 8 l can be written by part 

<5m* = E x ~p et [V f {X)(X - m*)] 



mfC-'-C- 1 )] ■ (4) 
[Sm T ,vec(SC) T ] T 



5C l = E.} 



[VfmUX-m'XX-mY-C*)] 



With different learning rates for mean vector and covariance 
matrix updates, the natural gradient descent (TTJ reads 



m* — VmSrn ', C t+1 = C* — rfcSC 1 



(5) 



We refer to ([5]) for the deterministic natural gradient descent 
(NGD) method. 

2.2 Stochastic NGD Algorithm on R d 

When VJ(9) is not given, we need to estimate the gra- 
dient. We approximate the natural gradient and simulate 
the natural gradient descent as follows. Initialize the mean 



vector m° and the covariance matrix C° and repeat the fol- 
lowing steps until some termination criterion is satisfied: 

1. Compute the eigenvalue decomposition of C , [B, D] — 
ere (C*), where B is an orthogonal matrix and D is a 
diagonal matrix such that C* = BDB T . 

2. Compute the square root of C f , VC 1 = B\fDB T . 

3. Generate normal random vectors zi z n ~A/"(0,/d)- 

4. Compute Xi = m* + VC*Zi, for i = 1, . . . ,n. 

5. Evaluate the objective values f{xi), . . . , f(x n ); 

6. Estimate Vf(xi) as 

(27r) d/2 det(D) 1/2 



V f (xi) 



E 



exp 



7. Compute the baseline 6 = ^/(^O/ 71 - 

8. Compute the weights Wi = (Vf(xi) — b)/n. 

9. Estimate the natural gradient <5m* and <5C' as 

n 

Sm t = Wi(a;i — m 1 ) 

n 

5C" = ^2 Wi((xi — m t )(x i — m') T — C*) . 



(6) 



10. Compute the learning rates r\ m and r;^. 

11. Update the parameters as m t+1 = m' — 7/ m <5m* and 

= C l -■qc&C 1 . 

We refer to this algorithm for the stochastic NGD algorithm. 

This algorithm generates n samples Xi from M{m t , C) 
in steps 1-4 and evaluates their objective values in step 5. 
In step 6, the invariant costs Vf(xi) are evaluated. The 
estimates Vf(xi) are obtained as follows. By definition we 
have 

Applying Monte-Carlo approximation we have 

2/d 



(7) 



Since p 6 t(xj) = ((27r) d det(D)) 7 exp(||Zj|| 2 /2), we have 
the estimates Vf(xi) in step 6. Step 7 computes the baseline 
b that is often introduced to reduce the estimation variance 
of gradients while adding no bias [12]. We simply choose 
the mean value of the Vf(xi) as the baseline. Replacing 
the expectation in (JS| with the sample mean and adding the 
baseline (in step 8) we have the Monte-Carlo estimate of the 
natural gradient in step 9. Finally in step 11, we update the 
parameters along the estimated natural gradient with the 
learning rates computed in step 10. The learning rates are 
chosen in the following so that they are inverse proportional 
to the largest eigenvalue of the following matrix 



z t = ( C .*)-l/2 5Ct(C/ t ) -l/2 = ^ w .(^T_ /d 



(8) 



2.3 Difference from the IGO 

The difference between the IGO algorithm and our de- 
terministic algorithm is that the invariant cost in the IGO 
algorithm is defined by negative of the weighted quantile, 
— w(P e t[y : f(y) ^ f (x)\), where w : [0, 1] h-> K is non- 
increasing weight function. Since the quantile P g t [y : f(y) < 
f{x)] depends on the current parameter 9*, Vf(x) for each 
x in the IGO algorithm changes from iteration to iteration, 
whereas it is fixed in our algorithm. This property makes 
our algorithm easier to analyze mathematically. 

The difference between our stochastic algorithm and the 
pure rank-// update CMA-ES [14] is the same as the differ- 
ence between the deterministic one and the IGO algorithm. 
The pure rank-/i update CMA-ES approximates the quantile 
value Pgt [y : f(y) ^ f{x)} by the number of better solutions 
divided by the number of samples n, Rijn= \{xj : f(xj) ^ 
f(xi)}\/n. Therefore, the pure rank-^i update CMA-ES sim- 
ulates the same lines as the stochastic NGD algorithm de- 
scribed in Section \2. 21 with the weights u>i = w(Ri/n)/n. 

In Section 0] we compare the stochastic NGD algorithm 
with the pure rank-/x update CMA-ES where 



1/K4J 




if Ri/n < Ln/4J 
otherwize. 



(9) 



2.4 Basic Properties 

Invariance. Our algorithms inherit two important invari- 
ance properties from the IGO and the CMA-ES: invariance 
under monotonic transformation of the objective function 
and invariance under afnne transformation of the search 
space (with the same transformation of the initial param- 
eters). Invariance under monotonic transformation of the 
objective function makes the algorithm perform equally on 
a function / and on any composite function g o f where g 
is any strictly increasing function. For example, the convex 
sphere function f(x) = \\x\\ 2 is equivalent to the non-convex 
function f(x) = ||a:|| 1 ' /2 for this algorithm, whereas conven- 
tional gradient methods, e.g. Newton method, assume the 
convexity of the objective function and require a fine line 
search to solve non-convex functions. This invariance prop- 
erty is obtained as a result of the transformation / Vf. 
Invariance under afnne transformation of the search space 
is the essence of variable metric methods such as Newton's 
method. By adapting the covariance matrix, this algorithm 
attains universal performance on ill-conditioned objective 
functions. 

Positivity. The covariance matrix of the Gaussian dis- 
tribution must be positive definite and symmetric at each 
iteration. The next proposition gives the condition on the 
learning rate rf c such that the covariance matrix is always 
positive definite symmetric. 

Proposition 1. Suppose that the learning rate for the 

in the de- 



covariance update rf c < Aj~ (vC* 5C t vC^ 
terministic NGD algorithm, where Ai(-) denotes the largest 
eigenvalue of the argument matrix. If C° is positive defi- 
nite symmetric, C* is positive definite symmetric for each 
t. Similarly, if rf c < X^ 1 (Z t ) in the stochastic NGD al- 
gorithm, where Z l is defined in ((5), and if C° is positive 
definite symmetric, the same result holds. 

Proof. Consider the deterministic case Suppose 



that C' is positive definite and symmetric. Then, 



c t+1 = vc* i d - nhvc* scwc 



by the assumption, all 
C t 1 is smaller than one. 



Since rf c < \- l {VC* SC'VCT' 
the eigenvalues of rfc^ 5C* 
Thus, the inside of the brackets is positive definite sym- 
metric and hence C t+1 is positive definite symmetric. By 
mathematical induction, we have that C* is positive definite 
and symmetric for all t ^ 0. The analogous result for the 
stochastic case is obtained in the same way. □ 

Consistency. The gradient estimator ([6]) is not necessarily 
unbiased, yet it is consistent as is shown in the following 
proposition. Therefore, one can expect that the stochastic 
NGD approximates the deterministic NGD well when the 
sample size n is large. Let V : J 1— > I g ~ 1 VJ be the natural 
gradient operator. 

Proposition 2. Let Xi, ... ,X n be independent and iden- 
tically distributed random vectors following Pg. Let V/(x) 
and Gg = [(Sm t ) T , vec(<5C') T ] T be the invariant cost @ 
and the natural gradient where x\,...,x n are replaced 
with X\ , . . . , X n . Suppose that 



E[V f (X) 2 } < 00. 



(10) 



Then, Gg ^4' VJ(#), where —i represents almost sure con- 
vergence. 

PROOF. _By the Cauchy-Schwarz inequality we have that 
E[\\Vf(X)Vl{9;X)\\} 2 <E[V f (X) 2 ]E[\\Vl(6;X)\\ 2 ]. Notethat 
E[||VZ((9;X)|| 2 ] = Tr(X e _1 ) < 00. By Jensen's inequality we 
have that E[V f (X)] 2 < E[V f {X) 2 ]. Therefore, JTDJl implies 



E[\\Vf(X)Vl(6;X)\\] < 00 and 
E[V f (X)] < 00 . 



(11) 
(12) 



Define h n (x) — Vf 



Vf(x) and decompose Gg as 



1 n 1 n 

G n e = -V V f (Xi)Vl(e-,Xi)+ - V hn(Xi)Vl(0;Xi) 

i=l i=l 

-(^VfixAf^VWXt)). (13) 



By (|11[) and the strong law of large numbers (LLN), the first 
summand converges to E[Vf(X)X7l(9; X)] — VJ(#) almost 
surely as n — > 00. So we have to show that the second term 
and the third term of (|13|) converge almost surely to zero. 
First, we show the following almost sure convergence 



1 " 

— y h n (Xi) as n — > 00. 

n — J 



(14) 



By the definition of Vf(x), we have 



lim Vf (x) — I lim — 

n — V rsn I n. — ^ nci T) ' ^ 



1 1 f(X j )^f(x) 



2/d 



and since (I12|l implies /J,Leb[v '■ f(y) ^ f( x )] < 00 almost 
everywhere, we have by LLN 

= $l[y-f{y)^m] = Vf{x) 



almost surely and almost everywhere in x. This implies 
h n (x) -4 almost everywhere in x. 
For m < n, we have 



I 1 " 1 " 

lim —7 h„(Xi)\ ^ lim — y s\lp\hj(Xi 



1 - 

^ lim —y^ sup\hj(Xi)\ 

n— »oo n. — ' --^ 



(15) 



Since h n (x) ^4 almost everywhere in x as n — >• 00, we 
have swpj^ m \hj(x)\ -4' almost everywhere in 1 as m -> 00. 
By the Lebesgue's dominated convergence theorem we have 
E[sup J>m |/ij(X)|] — > as m — >• 00. Therefore, we have that 
E[sup .^ m |/ij < 00 for m large enough. Then, by apply- 
ing LLN, we have that the right most side of (|15f) converges 
to E[sup J>m |ftj(X)|] as n — >• 00 and this expectation con- 
verges to as m — > 00. This ends the proof of (|14p . 

Now we can obtain the almost sure convergence of the 
third term of (|13jl to zero. Indeed, the almost sure con- 
vergence (|14p implies that the limit lmin-i-oo J^ILi VH-^OA 1 
agrees with linin-joo J^jLi an d we have from (fl2"Tl 

and LLN that T,"=iVf(X l )/n a -4 E[V/(X)] < 00. Also, 

by LLN we have that 5^™ =1 VZ(#; Xi)jn as n — > 00. 
Therefore, the third term of (|13l) converges to zero almost 
surely. 

To show the convergence of the second term of (|13[1 to 
zero, we apply the Cauchy-Schwarz inequality to it and we 
have 



E 



h n {Xi)vi{e-,Xi 



E 



h n (Xj) 2 



E 



ivmxi) 



By LLN we have that the second term of the right hand side 
converges to E[|| VZ(0; -X")|| 2 ] = Tr^^ 1 ). So we have to prove 
that the first term on the right hand side converges to zero 
almost surely. The proof of this convergence is done in the 
same way as above with h\ replacing h n . □ 



We remark that (|12[1 is the necessary and sufficient condi- 
tion for J(8) to exist and that (|ll|l is a sufficient condition 
for the exchange of integral and differentiation used in ([3}. 
See e.g. [5] Theorem 16.8]. 

3. CONVERGENCE PROPERTIES OF THE 
DETERMINISTIC NGD ALGORITHM 

We investigate the convergence properties of the determin- 
istic NGD algorithm )5( on a monotonic convex-quadratic 
composite function f(x) = g(x T Ax), where g is any strictly 
increasing function and A is a positive definite symmetric 
matrix. 



Proposition 3. The natural 

Ig l VJ{6) 



ent can be written as 



CAm 
vec(CAC) 



Proof. Since HLcb[y '■ f{y) ^ f( x )] is equivalent to the 
volume of the ellipsoid {y : y 1 Ay ^ x T Ax}, we have that 

2 



PLeb[y : }{y) < f(x)] 



-V d (Vx^Ax) 



det(A) 

where Vd(r) denotes the volume of the sphere with radius 
r and is proportional to r d . Therefore Vf(x) = fJ^^ly ■ 



f(y) ^ f( x )] x x T Ax. Since the proportionality constant is 
independent of x, we have 

j(e) = E x ~ Pe [Vf(x)] 

oc E x ~p e [X T AX] = m T Arn + Tr{AC). 
Differentiating the both side of the above relation, we have 

VJ(0) oc [ 2A ™ . 
y ' vec(A) 

Premultiplying by Tg 1 , we obtain the intended result. □ 

Now the deterministic NGD algorithm on g{x T AX) is im- 
plicitly written as 

m t+1 — m — rf m &rri , 5rn = cC* Am (16) 

C t+1 = C t -rf c 5G t , 5C i = cC t AC t , (17) 

where c > is the proportionality constant appearing in the 
proof of Proposition [3] 

In the following, we work on the following assumption: 
There are 7m, mm > and 7c, mm > such that 

7m,min < ffm M ( (C ) " 1 5C* ) «S 1, (18) 

7c,min < rf c X 1 {{C t )- 1 8C t ) s= 1/2. (19) 

These assumptions are satisfied, for example, if if m and rf c 
are set for each iteration so that rf m — rf c = a/X\ ((C t )~ 1 5C t ). 
In this case the natural gradient can be considered to be nor- 
malized by Ai((C*)~ SC 1 ) and the pseudo-learning rate is 
a. 

The next theorem states that the covariance matrix con- 
verges proportionally to the inverse of the Hessian matrix. 

Theorem 4. Assume ( fl9)) . The condition number of C* A 
converges to one with the rate of convergence 

,. Cond(C i+ M) - 1 1 - 2 7C ,min , on ^ 

hmsup — — A (ntA\ i ^ ! ' ( 20 ) 

t^oo Cond(C*yl) - 1 1 - 7 c,min 

Moreover, we have an upper bound 
Cond(C ,t A) < 1 + (}__2£I£^A (Cond(C*°yl) - 1). (21) 

V 1 — 7C,min / 

If the limit fciim = limt-»oo n t c \i((C t )~ 1 5C t ) exists, 7 c,mi n 
is replaced with 7c,iim (1201) . 

Proof. Since A is positive definite and symmetric, there 
exists the square root \J~A. Premultiplying and postmulti- 
plying both side of covariance matrix update (|17[) by s/cA, 
we have 

cVAC t+1 VA = cVAC'VA-rj^icVAC'VAf. 

Since c\f~AC t \f~A is positive definite and symmetric, there 
exists an eigenvalue decomposition Q t A t (Q t ) T , where the 
diagonal elements of A' = diag(A^ , . . . , X a ) are the eigenval- 
ues of c\ // ~AC t \ // ~A and each column of Q t is the eigenvector 
corresponding to each diagonal element of A*. Then, 

cVAC t+1 VA = Q* (A* - ^(A*) 2 ) (Q t ) T . 

This means, Q l also diagonalizes c\/AC t+1 \J~A. By math- 
ematical induction we have that an orthogonal matrix Q 
which diagonalizes c\/A~C \fA diagonalizes c\fAC t ^/~A for 
any t ^ and we have 



Next, we show that the condition number of A* converses 
to 1 as t — > oo. Remember 5C t = cC l AC 1 . We have 
\x{{C t Y x bG t ) = Ai(eAC') = XxipjA&yTK) = Ai(A'). 
Then, by assumption (llQf) we have 



7C,min < 7?C>l(A') < 1/2. 



(23) 



Moreover, since Ai(A') ^ A* for any i, we have ??c(Ai + Aj) ^ 
1 for any i, j. 

Suppose A* ^ Aj without loss of generality. From (|22|l and 
the inequality rf c (X\ + A*) ^ 1, we have 

A* +1 - A* +1 = A|(l - VcXl) - A*(l - Vc^) 
= (l-J?c(A* + Aj))(A*-A*) ^0 



with equality holding if and only if A* = \j. Therefore, if 
A- > A* , then A* +1 > X* +1 , which implies that if ith and jth 
diagonal elements of A are the maximum and minimum 
elements, ith and jth elements of A* are also the maximum 
and minimum elements of A 4 . Without loss of generality 
we suppose A* ^ Xj for any i ^ j for all t ^ 0. Then, 
Ai/A* d = Cond(A') 
have 

%t+l %t+l 



\t+i 
Cond(C*+M)-l 



Cond(C'A). According to (gU we 

Xijl-nhxD-XUl-vh^) 
A* d (l-«) 



(Af-A$) l-^(Ai+A* d ) 



(24) 



Cond(C*A)-l 



Since 

l-vhM + Xi) _ 

(1-t^A*) 1- 

we have 

Cond(C t+ M) 



1-77^1(1 + X\/ X\) ^ l-2rf c X\ 



Cond(C t yl) 



1 1 
— € — 



Vc 



A', 



(25) 



Moreover, since the right-hand side of the above inequality 
is maximized when tiqX\ is minimized and n^X\ is bounded 
from below by 7c,min because of (|23p . we have 



Cond(C t+ M) 



Cond(C"A) 



1 1 
— £ 



27c,, 



1 - JC,n 



This implies limt-joo Cond(C'j4) = 1. The rate of conver- 
gence (|20|) and the upper bound (121[) are immediate conse- 
quences of the above inequality. 

If the limit 7c,iim exists, it is easy to see from (|25|) that 
7c,min can be replaced with jc,lim in ((20)) . This completes 
the proof. □ 

Note that if rf c = a/X 1 ((C t )~ 1 5C t ) and a < 1/2, we have 
that 7c,iim = 7c,min = ct- We have from ((24)) that 



Cond(C t+1 A) = CondiC'A) 



1 -aCond-^C*^) 



(26) 



A *+i = A * 



(22) 



and the rate of convergence becomes (1 — 2a) /(l — a). 

The next theorem states the global convergence of m and 
C and the speed of the convergence. In the following, we 
let ||M|| denote the Frobenius norm of M, namely ||M|| = 
Tr 1/2 (M T M). 



Theorem 5. Assume (fl9|) and (JT5J. Then, \\m t \\ and 
|C || converge to zero with the rate of convergence 



lim sup ■ 



< 1 - 7«.: 



(27) 



where k is either m or C and k* is either m* or C . If 
the limit 7 re ,um = limt-Hx, n l R Ai((C t )~ 5C l ) exists, 7 K , m i n is 
replaced with 7 K ,ii m in (J2TJ . 

Proof. Let <7j(-) denote the zth largest singular value of 
the argument matrix. According to J. von Neumann's trace 
inequality [19] we have |Tr(MiM 2 )| < Y%=\ o-<(Mi)<Ti(Af 2 ) < 
cri(Mi) 53 4=1 0i(M2), where Mi and A72 are any matrices in 
R dxd . Let M € R dxd be nonnegative definite and S £ R dxd 
is nonnegative definite symmetric. From the above inequal- 
ity, we have 

\\MSf = Tv{SM T MS) = Tv(M T MS 2 ) 

d 

sC a 1 {M T M)Y,^(S 2 ) = aj{M)\\S\\ 2 . 
i=l 

Applying this matrix norm inequality and the vector norm 
inequality ||A7a;|| 2 < ai(M) 2 ||x|| 2 to |16} and p7|l. we have 



I K II 



In light of Theoremgl we have that limt->-oo G t A/Xi(C t A) = 
Id- Then, from the assumptions (|18|l and ()19|) we have 

limsupo"i(7 — cqJJ 1 'A) 
t 

= lim sup <Ti (l - rf K X 1 (cC t A) - J^^j ) ^ 1 _ 7«,min. 



This implies linear convergence of k with rate of convergence 
at most 1 - -y K 

, mill - 

If the limit 7 K ,u m exists, we can easily see from the above 
inequality that 7 K . m i n is replaced with 7 Kl u m in (|27p . This 
ends the proof. □ 

Now we can see the importance of the covariance matrix 
adaptation quantitatively. Let rf^ = rf c — a/\i((C t )~ 1 SC t ). 
Then, the covariance matrix becomes proportional to the in- 
verse of the Hessian at the speed given by (|26[1 and the rate 
of convergence of the parameter becomes 1 — a. Meanwhile, 
if the covariance matrix is restricted to a product of a scalar 
v and the identity matrix, C* = v*I, then the rate of con- 
vergence is in [1 — a, 1 — a Therefore, the 
rate of convergence becomes close to one in the worst case 
if Cond(A) > 1. 

From Theorem 0] we know that the deterministic NGD 
algorithm learns the inverse of the Hessian. The conver- 
gence of the covariance matrix to the inverse of the Hessian 
matrix in the CMA-ES has been anticipated but it has not 
been proven. Theorem [4] demonstrates this anticipation af- 
firmatively for the deterministic NGD algorithm. Theorem[S] 

1 If the covariance matrix is restricted to a diagonal ma- 
trix, the target matrix is diag(y4) = diag(j4i,i, . . . , Ad,d), 
i.e. lim t Cond(C 4 diag(yl)) = 1. the rate of convergence 
is in [a,j (7 — a Cond(^4)), <7i(7 — a Cond(yl))], where A = 
diag(yl) -1 A We omit the derivation due to the space limi- 
tation. 



exhibits the linear convergence of the parameters. This im- 
plies that the rate of convergence of the expected objective 
value J(8) oc m T Am + Tr(CA) is also linear and equals to 
the rate of convergence of ||C* || . 

4. NUMERICAL SIMULATION 

The results in the previous section are for the determin- 
istic (ideal) NGD algorithm. Thanks to Proposition [5] we 
can expect that the stochastic NGD algorithm proposed in 
Section 12.21 approximates the deterministic one arbitrarily 
close as the sample size n is taken sufficiently large. In this 
section, we evaluate how well the stochastic variant with 
finite sample size approximates the deterministic one on a 
quadratic function. 

We consider the 20-dimensional ellipsoid function 

fix) ^Y, 10 ^ 1 ^ id = 20). 
i=i 

Note that the ellipsoid function is separable and convex but 
our algorithm does not exploit the separability and convex- 
ity. The eigenvalues (diagonal elements) of the Hessian ma- 
trix of the ellipsoid function range in [1, 10 6 ]. We set the 
initial parameters as m° = (0, . . . , 0) T and C° = Id- 
We design the learning rates as 



Vm = 



and r)c = 



cc 



Here, Z l is a matrix defined in 



c c < 1. 



4.1 Effect of Sample Size and Learning Rate 

First, we investigate the effect of the sample size n and the 
coefficient cc of the learning rate rf c - We try the following 
sample sizes: \d 1/2 ], d, \d s/2 ], d 2 , \d 5/2 ], d 3 . 

Figure Q] illustrates the slope of the condition number 
Cond(C ^4) and the theoretical curve l|26p and the slope 
of the expected objective function value Ex~p 8 , [X T AX] = 
(m 4 ) 1 Am 1 + Tr(C'^4), averaged over 50 independent trials. 
When the sample size is larger, we see the closer performance 
to the theoretical result. When n — d 3 and cc = 0.1, the 
convergence curve of the condition number approximated 
well the theoretical curve and the final condition number is 
Cond(C*A) w 1.1. When n = \d 1/2 ] and c c = 0.1, it takes 
more than 30 times longer to learn the covariance matrix 
and the final condition number becomes Cond(C'yl) w 4.0, 
although the stochastic algorithm still works successfully. 
We attain a little higher condition numbers when we choose 
larger learning rates cc = 0.5, 1.0. For example, the final 
condition numbers are Cond(C 4 A) w 1.3 for n = a? and 
c c = 0.5, and Cond(C'A) w 1.6 for n = d 3 and c c = 1.0. 
This is because smaller learning rates have more effect of 
averaging the natural gradient estimates over iterations and 
reducing the estimation variance. 

Note that we observe a slightly slower adaptation of the 
covariance matrix at the beginning in case that we set m° — 
(10, ... , 10), although the adaptation behavior (|26p does not 
change in theory. See Figure [2] This attributes to the esti- 
mation precision of Vj. If the squared Mahalanobis distance 
(m*) T (C*) _1 m* between the origin (the global optimum) 
and the current mean with respect to C is larger, the func- 
tion landscape around m' looks more like linear function. 
Then Vfixi) are far from the exact values, especially in case 
a small sample size is chosen. 



solid line: m" = (0 0), dashed line: m u = (10 10) 




10 

t: no. of iteration 



Figure 1: Averages and standard deviations of the 
condition numbers Cond(C'yl) for cc = 0.1, 0.5, and 
1.0 and the expected objective function values for 
cc = 0.1. Theoretical curves (|26|l of the condition 
number are also illustrated with dashed lines. All 
the lines are cut after first reach of the expected 
objective value to 10~ 10 . Some results are omitted, 
for example n = d for cc = 0.5, because numerically 
unstable computation occurs during search. 




10 

t: no. of iteration 



Figure 2: Averages and standard deviations of the 
change of the condition numbers for cc = 0.1 for 
n = d, d 2 , d 3 with initial mean vector m = (0, ... ,0) 
or m = (10, . . . , 10). Other settings are the same as in 
Figure [TJ 



4.2 Comparison with Rank-^ update CMA-ES 

Finally, we study how well this stochastic algorithm simu- 
lates the CMA-ES. We test the pure rank-/^ update CMA-ES 
with weight scheme ([9]). We set the learning rates follow- 
ing [S] 



= 1, 



Vc = 



(d + 2) 2 



where fx m = 1/ Y^7=i w i- We c h° ose cc for our algorithm so 
that the speed of adaptation for each model is almost the 
same. 

Figure [3] shows the results for each method for n — d 
and n = d 2 . In both case, we confirm similar behaviors of 
the pure rank-/i update CMA-ES and our algorithm despite 
their dissimilar weight-value settings. The similar change of 
performance illustrated in Figure [2] is also observed for the 
pure rank-/i update CMA-ES. From this result, we conclude 
that it is possible to estimate the performance of the pure 
rank-/i update CMA-ES by our natural gradient algorithm, 
which is theoretically more attractive. 

However, note that the pure rank-^i update CMA-ES is 
not the standard CMA-ES [13] and the standard CMA- 
ES performs better than the pure rank-/i update CMA-ES. 
The standard CMA-ES employes so-called evolution paths 
to adapt the covariance matrix and the global scale of the 
covariance matrix, which is called step-size in the CMA- 
ES context. Moreover, the standard CMA-ES employes 
weighted recombination, where different values are assigned 
to the weights for Ri ^ [n/2\, which is only slightly bet- 
ter than intermediate recombination © and even similar to 
our setting. Furthermore, the similar performance observed 
is only on a quadratic function. If there are certain functions 
which distinguish our algorithm from the (rank-/i) CMA-ES, 
this may help to understand both the NGD algorithm and 
the CMA-ES. Further study on these topics is required. 

5. CONCLUSION 

We have proposed a novel natural gradient descent (NGD) 
method where the objective function is transformed to a 
function defined on the parameter space of probability dis- 




t: no. of iteration ,. n0 of iteration 

Figure 3: Averages and standard deviations of the 
change of the condition numbers and the change of 
the expected objective function values for n = d and 
cc = 0.14 on the left, and for n = d 2 and cc = 0.75 on 
the right. Other settings are the same as in FigureflJ 

tributions. We have proven that the deterministic NGD 
method learns the inverse of the Hessian of the original 
objective function that is any monotonic convex-quadratic- 
composite function. Linear convergence of the mean vector 
and the covariance matrix has been also proven. The nu- 
merical results for the stochastic NGD algorithm have shown 
that the stochastic algorithm approximates the deterministic 
one well when the sample size is sufficiently large. Moreover, 
we have confirmed that the stochastic NGD algorithm and 
the pure rank-/i update CMA-ES behave very similarly on 
a quadratic function. 

The contribution of the paper is to derive a novel NGD al- 
gorithm that can be viewed as a variant of the CMA-ES from 
the first principle of information geometry. This allows us to 
analyze the algorithm theoretically. Our theoretical results 
in Section|3]imply that there is at least one weight- value set- 
ting in the CMA-ES such that the covariance matrix learns 
the inverse of the Hessian of the objective function. More- 
over, since our algorithm does not only share most of the im- 
portant properties of the rank-/i update CMA-ES, but also 
is confirmed to perform similarly to the pure rank-/i update 
CMA-ES on a quadratic function by numerical simulations, 
we could study our algorithm to find out limitations of the 
pure rank-jj, update CMA-ES and to discover a way to im- 
prove the CMA-ES. 
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